{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Datasets for the Content-based Filter\n",
    "\n",
    "This notebook builds the data you will use for creating our content based model. You'll collect the data via a collection of SQL queries from the publicly available Kurier.at dataset in BigQuery.\n",
    "Kurier.at is an Austrian newsite. The goal of these labs is to recommend an article for a visitor to the site. In this notebook, you collect the data for training, in the subsequent notebook you train the recommender model. \n",
    "\n",
    "This notebook illustrates:\n",
    "* How to pull data from BigQuery table and write to local files.\n",
    "* How to make reproducible train and test splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "from google.cloud import bigquery \n",
    "\n",
    "PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID\n",
    "BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME\n",
    "REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n",
    "\n",
    "# do not change these\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['REGION'] = REGION\n",
    "os.environ['TFVERSION'] = '2.1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\n",
      "Updated property [compute/region].\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud  config  set project $PROJECT\n",
    "gcloud config set compute/region $REGION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will use this helper function to write lists containing article ids, categories, and authors for each article in our database to local file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def write_list_to_disk(my_list, filename):\n",
    "  with open(filename, 'w') as f:\n",
    "    for item in my_list:\n",
    "        line = \"%s\\n\" % item\n",
    "        f.write(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pull data from BigQuery\n",
    "\n",
    "The cell below creates a local text file containing all the article ids (i.e. 'content ids') in the dataset. \n",
    "\n",
    "Have a look at the original dataset in [BigQuery](https://console.cloud.google.com/bigquery?p=cloud-training-demos&d=GA360_test&t=ga_sessions_sample). Then read through the query below and make sure you understand what it is doing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Some sample content IDs ['299922662', '299826775', '299437612']\n",
      "The total number of articles is 15634\n"
     ]
    }
   ],
   "source": [
    "sql=\"\"\"\n",
    "#standardSQL\n",
    "\n",
    "SELECT  \n",
    "  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id \n",
    "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   \n",
    "  UNNEST(hits) AS hits\n",
    "WHERE \n",
    "  # only include hits on pages\n",
    "  hits.type = \"PAGE\"\n",
    "  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n",
    "GROUP BY\n",
    "  content_id\n",
    "  \n",
    "\"\"\"\n",
    "\n",
    "content_ids_list = bigquery.Client().query(sql).to_dataframe()['content_id'].tolist()\n",
    "write_list_to_disk(content_ids_list, \"content_ids.txt\")\n",
    "print(\"Some sample content IDs {}\".format(content_ids_list[:3]))\n",
    "print(\"The total number of articles is {}\".format(len(content_ids_list)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There should be 15,634 articles in the database.  \n",
    "Next, you'll create a local file which contains a list of article categories and a list of article authors.\n",
    "\n",
    "Note the change in the index when pulling the article category or author information. Also, you are using the first author of the article to create our author list.  \n",
    "Refer back to the original dataset, use the `hits.customDimensions.index` field to verify the correct index.\t "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['News', 'Stars & Kultur', 'Lifestyle']\n"
     ]
    }
   ],
   "source": [
    "sql=\"\"\"\n",
    "#standardSQL\n",
    "SELECT  \n",
    "  (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category  \n",
    "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   \n",
    "  UNNEST(hits) AS hits\n",
    "WHERE \n",
    "  # only include hits on pages\n",
    "  hits.type = \"PAGE\"\n",
    "  AND (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n",
    "GROUP BY   \n",
    "  category\n",
    "\"\"\"\n",
    "categories_list = bigquery.Client().query(sql).to_dataframe()['category'].tolist()\n",
    "write_list_to_disk(categories_list, \"categories.txt\")\n",
    "print(categories_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The categories are 'News', 'Stars & Kultur', and 'Lifestyle'.  \n",
    "When creating the author list, you'll only use the first author information for each article.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Some sample authors ['Marlene Patsalidis', 'Yvonne Widler', 'Thomas  Trescher', 'Johanna Hager', 'Elisabeth Spitzer', 'Daniela Wahl', 'Stefan Berndl', 'Heinz Wagner', 'Brigitte Schokarth', 'Mathias Kainz']\n",
      "The total number of authors is 385\n"
     ]
    }
   ],
   "source": [
    "sql=\"\"\"\n",
    "#standardSQL\n",
    "SELECT\n",
    "  REGEXP_EXTRACT((SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)), r\"^[^,]+\")  AS first_author  \n",
    "FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   \n",
    "  UNNEST(hits) AS hits\n",
    "WHERE \n",
    "  # only include hits on pages\n",
    "  hits.type = \"PAGE\"\n",
    "  AND (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n",
    "GROUP BY   \n",
    "  first_author\n",
    "\"\"\"\n",
    "authors_list = bigquery.Client().query(sql).to_dataframe()['first_author'].tolist()\n",
    "write_list_to_disk(authors_list, \"authors.txt\")\n",
    "print(\"Some sample authors {}\".format(authors_list[:10]))\n",
    "print(\"The total number of authors is {}\".format(len(authors_list)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There should be 385 authors in the database. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create train and test sets\n",
    "\n",
    "In this section, you will create the train/test split of our data for training our model. You use the concatenated values for visitor id and content id to create a farm fingerprint, taking approximately 90% of the data for the training set and 10% for the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visitor_id</th>\n",
       "      <th>content_id</th>\n",
       "      <th>category</th>\n",
       "      <th>title</th>\n",
       "      <th>author</th>\n",
       "      <th>months_since_epoch</th>\n",
       "      <th>next_content_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1013445690169368902</td>\n",
       "      <td>299827911</td>\n",
       "      <td>News</td>\n",
       "      <td>\"Vulkanausbrüche sind normal\"</td>\n",
       "      <td>Michaela Reibenwein</td>\n",
       "      <td>574</td>\n",
       "      <td>299779564</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1013445690169368902</td>\n",
       "      <td>299779564</td>\n",
       "      <td>Stars &amp; Kultur</td>\n",
       "      <td>Geschenk: Nicole Kidman bekommt Traumhaus um 4...</td>\n",
       "      <td>Elisabeth Spitzer</td>\n",
       "      <td>574</td>\n",
       "      <td>299777664</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1022059616427871901</td>\n",
       "      <td>299798467</td>\n",
       "      <td>Lifestyle</td>\n",
       "      <td>Frau täuscht Tod vor um Fake-Liebhaber zu entk...</td>\n",
       "      <td>Elisabeth Mittendorfer</td>\n",
       "      <td>574</td>\n",
       "      <td>299777082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1022059616427871901</td>\n",
       "      <td>299777082</td>\n",
       "      <td>Lifestyle</td>\n",
       "      <td>Die simple Strategie für strahlende Model-Haut</td>\n",
       "      <td>Maria Zelenko</td>\n",
       "      <td>574</td>\n",
       "      <td>299814775</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1029992987987017563</td>\n",
       "      <td>299779564</td>\n",
       "      <td>Stars &amp; Kultur</td>\n",
       "      <td>Geschenk: Nicole Kidman bekommt Traumhaus um 4...</td>\n",
       "      <td>Elisabeth Spitzer</td>\n",
       "      <td>574</td>\n",
       "      <td>299826775</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            visitor_id content_id        category  \\\n",
       "0  1013445690169368902  299827911            News   \n",
       "1  1013445690169368902  299779564  Stars & Kultur   \n",
       "2  1022059616427871901  299798467       Lifestyle   \n",
       "3  1022059616427871901  299777082       Lifestyle   \n",
       "4  1029992987987017563  299779564  Stars & Kultur   \n",
       "\n",
       "                                               title                  author  \\\n",
       "0                      \"Vulkanausbrüche sind normal\"     Michaela Reibenwein   \n",
       "1  Geschenk: Nicole Kidman bekommt Traumhaus um 4...       Elisabeth Spitzer   \n",
       "2  Frau täuscht Tod vor um Fake-Liebhaber zu entk...  Elisabeth Mittendorfer   \n",
       "3     Die simple Strategie für strahlende Model-Haut           Maria Zelenko   \n",
       "4  Geschenk: Nicole Kidman bekommt Traumhaus um 4...       Elisabeth Spitzer   \n",
       "\n",
       "   months_since_epoch next_content_id  \n",
       "0                 574       299779564  \n",
       "1                 574       299777664  \n",
       "2                 574       299777082  \n",
       "3                 574       299814775  \n",
       "4                 574       299826775  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sql=\"\"\"\n",
    "WITH site_history as (\n",
    "  SELECT\n",
    "      fullVisitorId as visitor_id,\n",
    "      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,\n",
    "      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, \n",
    "      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,\n",
    "      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,\n",
    "      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,\n",
    "      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions\n",
    "  FROM \n",
    "    `cloud-training-demos.GA360_test.ga_sessions_sample`,   \n",
    "     UNNEST(hits) AS hits\n",
    "   WHERE \n",
    "     # only include hits on pages\n",
    "      hits.type = \"PAGE\"\n",
    "      AND\n",
    "      fullVisitorId IS NOT NULL\n",
    "      AND\n",
    "      hits.time != 0\n",
    "      AND\n",
    "      hits.time IS NOT NULL\n",
    "      AND\n",
    "      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n",
    ")\n",
    "SELECT\n",
    "  visitor_id,\n",
    "  content_id,\n",
    "  category,\n",
    "  REGEXP_REPLACE(title, r\",\", \"\") as title,\n",
    "  REGEXP_EXTRACT(author_list, r\"^[^,]+\") as author,\n",
    "  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,\n",
    "  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id\n",
    "FROM\n",
    "  site_history\n",
    "WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL\n",
    "      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) < 9\n",
    "\"\"\"\n",
    "training_set_df = bigquery.Client().query(sql).to_dataframe()\n",
    "training_set_df.to_csv('training_set.csv', header=False, index=False, encoding='utf-8')\n",
    "training_set_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visitor_id</th>\n",
       "      <th>content_id</th>\n",
       "      <th>category</th>\n",
       "      <th>title</th>\n",
       "      <th>author</th>\n",
       "      <th>months_since_epoch</th>\n",
       "      <th>next_content_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1017181701779883231</td>\n",
       "      <td>299813480</td>\n",
       "      <td>Lifestyle</td>\n",
       "      <td>Alice Schwarzer: Periode des Rückschlags für F...</td>\n",
       "      <td>Elisabeth Mittendorfer</td>\n",
       "      <td>574</td>\n",
       "      <td>299800661</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1044735394411897753</td>\n",
       "      <td>299584180</td>\n",
       "      <td>News</td>\n",
       "      <td>Paaradox: Berlin bei Nacht</td>\n",
       "      <td>Gabriele Kuhn</td>\n",
       "      <td>574</td>\n",
       "      <td>299821998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1044735394411897753</td>\n",
       "      <td>299821998</td>\n",
       "      <td>News</td>\n",
       "      <td>Dieses Wort</td>\n",
       "      <td>Peter Pisa</td>\n",
       "      <td>574</td>\n",
       "      <td>299787238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1044735394411897753</td>\n",
       "      <td>299584180</td>\n",
       "      <td>News</td>\n",
       "      <td>Paaradox: Berlin bei Nacht</td>\n",
       "      <td>Gabriele Kuhn</td>\n",
       "      <td>574</td>\n",
       "      <td>299575760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1125519382829212847</td>\n",
       "      <td>296772080</td>\n",
       "      <td>News</td>\n",
       "      <td>5 vor 12 am Semmering: Neues Tragseil lässt ho...</td>\n",
       "      <td>Patrick Wammerl</td>\n",
       "      <td>574</td>\n",
       "      <td>296772080</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            visitor_id content_id   category  \\\n",
       "0  1017181701779883231  299813480  Lifestyle   \n",
       "1  1044735394411897753  299584180       News   \n",
       "2  1044735394411897753  299821998       News   \n",
       "3  1044735394411897753  299584180       News   \n",
       "4  1125519382829212847  296772080       News   \n",
       "\n",
       "                                               title                  author  \\\n",
       "0  Alice Schwarzer: Periode des Rückschlags für F...  Elisabeth Mittendorfer   \n",
       "1                         Paaradox: Berlin bei Nacht           Gabriele Kuhn   \n",
       "2                                        Dieses Wort              Peter Pisa   \n",
       "3                         Paaradox: Berlin bei Nacht           Gabriele Kuhn   \n",
       "4  5 vor 12 am Semmering: Neues Tragseil lässt ho...         Patrick Wammerl   \n",
       "\n",
       "   months_since_epoch next_content_id  \n",
       "0                 574       299800661  \n",
       "1                 574       299821998  \n",
       "2                 574       299787238  \n",
       "3                 574       299575760  \n",
       "4                 574       296772080  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sql=\"\"\"\n",
    "WITH site_history as (\n",
    "  SELECT\n",
    "      fullVisitorId as visitor_id,\n",
    "      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,\n",
    "      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, \n",
    "      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,\n",
    "      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,\n",
    "      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,\n",
    "      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions\n",
    "  FROM \n",
    "    `cloud-training-demos.GA360_test.ga_sessions_sample`,   \n",
    "     UNNEST(hits) AS hits\n",
    "   WHERE \n",
    "     # only include hits on pages\n",
    "      hits.type = \"PAGE\"\n",
    "      AND\n",
    "      fullVisitorId IS NOT NULL\n",
    "      AND\n",
    "      hits.time != 0\n",
    "      AND\n",
    "      hits.time IS NOT NULL\n",
    "      AND\n",
    "      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL\n",
    ")\n",
    "SELECT\n",
    "  visitor_id,\n",
    "  content_id,\n",
    "  category,\n",
    "  REGEXP_REPLACE(title, r\",\", \"\") as title,\n",
    "  REGEXP_EXTRACT(author_list, r\"^[^,]+\") as author,\n",
    "  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,\n",
    "  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id\n",
    "FROM\n",
    "  site_history\n",
    "WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL\n",
    "      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) >= 9\n",
    "\"\"\"\n",
    "test_set_df = bigquery.Client().query(sql).to_dataframe()\n",
    "test_set_df.to_csv('test_set.csv', header=False, index=False, encoding='utf-8')\n",
    "test_set_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the two csv files you just created containing the training and test set. You'll also do a line count of both files to confirm that you have achieved an approximate 90/10 train/test split.  \n",
    "In the next notebook, **Content Based Filtering** you will build a model to recommend an article given information about the current article being read, such as the category, title, author, and publish date. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   25599 test_set.csv\n",
      "  232308 training_set.csv\n",
      "  257907 total\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "wc -l *_set.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==> test_set.csv <==\n",
      "1017181701779883231,299813480,Lifestyle,Alice Schwarzer: Periode des Rückschlags für Frauen,Elisabeth Mittendorfer,574,299800661\n",
      "1044735394411897753,299584180,News,Paaradox: Berlin bei Nacht,Gabriele Kuhn,574,299821998\n",
      "1044735394411897753,299821998,News,Dieses Wort,Peter Pisa,574,299787238\n",
      "1044735394411897753,299584180,News,Paaradox: Berlin bei Nacht,Gabriele Kuhn,574,299575760\n",
      "1125519382829212847,296772080,News,5 vor 12 am Semmering: Neues Tragseil lässt hoffen,Patrick Wammerl,574,296772080\n",
      "1908943635382619435,299912041,News,Deutscher Bürgermeister bei Messerangriff schwer verletzt,,574,299791583\n",
      "2023876788459819379,299898438,News,USA: Oberstes Bundesgericht nimmt Klagen zu Waffengesetzen nicht an,,574,299853016\n",
      "2041575691631100228,299964154,News,Neue Seidenstraße: Ein chinesischer Keil in Europa,Hermann Sileitsch-Parzer,574,299826775\n",
      "2090601016705774782,299954138,News,\"Augsburg-Boss: \"\"Leipzig darf keine Lizenz haben\"\"\",Mathias Kainz,574,299903877\n",
      "2347336822873602099,299768605,Lifestyle,Benetton holt sich wieder Schock-Fotograf Oliviero Toscani ins Boot,Maria Zelenko,574,298324599\n",
      "\n",
      "==> training_set.csv <==\n",
      "1013445690169368902,299827911,News,\"\"\"Vulkanausbrüche sind normal\"\"\",Michaela Reibenwein,574,299779564\n",
      "1013445690169368902,299779564,Stars & Kultur,Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar ,Elisabeth Spitzer,574,299777664\n",
      "1022059616427871901,299798467,Lifestyle,Frau täuscht Tod vor um Fake-Liebhaber zu entkommen,Elisabeth Mittendorfer,574,299777082\n",
      "1022059616427871901,299777082,Lifestyle,Die simple Strategie für strahlende Model-Haut,Maria Zelenko,574,299814775\n",
      "1029992987987017563,299779564,Stars & Kultur,Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar ,Elisabeth Spitzer,574,299826775\n",
      "1029992987987017563,299777722,Stars & Kultur,\"Willow Smith: \"\"Es ist schrecklich so aufzuwachsen\"\"\",Elisabeth Spitzer,574,299772450\n",
      "1031539128969021923,299918278,News,Skipässe in Wintersport-Hochburgen massiv teurer,Stefan Hofer,574,299928807\n",
      "1031539128969021923,299928807,News,Missbrauchsvorwürfe: Tiroler Skiverband durchsuchte Heim-Protokolle,Mirad Odobasic,574,299816215\n",
      "1031539128969021923,299925086,News,Marihuana-Adventkalender findet in Kanada reißenden Absatz,,574,299814194\n",
      "1031539128969021923,299814194,News,WM-Qualifikation: Härtetest auf Mallorca für Österreichs Damen,Günther Pavlovics,574,299913368\n"
     ]
    }
   ],
   "source": [
    "!head *_set.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "python3",
   "name": "tf-gpu.1-15.m91",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf-gpu.1-15:m91"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
